Heart disease remains the leading cause of death globally, accounting for millions of deaths annually. The ability to accurately predict heart disease can lead to early diagnosis and intervention, significantly reducing the mortality rate associated with this condition. This project aims to leverage data science and machine learning techniques to analyze and predict the likelihood of heart attacks based on a wide range of factors including age, sex, chest pain type, resting blood pressure, serum cholesterol, and more.
Utilizing a dataset from Kaggle, this analysis will explore various features that may influence heart disease, implement several machine learning models to predict heart disease occurrence, and evaluate the performance of these models. The dataset includes diverse variables, offering a comprehensive insight into the factors that might contribute to heart disease.
Through this project, we aim to uncover significant predictors of heart disease, assess the predictive power of machine learning models in a medical context, and ultimately, contribute to the efforts of preventing heart disease by enabling early detection. This endeavor is not only a technical challenge but also a crucial step towards saving lives and improving health outcomes.
As we delve into the data and models, our goal is to present a clear, concise, and informative analysis that can serve as a foundation for further research and practical applications in the field of medical science and public health.
The heart is like an efficient pump, tirelessly working to circulate blood throughout the body about 60 to 80 times a minute when we’re at rest. Just like any other part of the body, the heart itself needs a steady supply of nutrients and oxygen, which it gets through its very own set of blood vessels known as coronary arteries.
Sometimes, these crucial arteries can have trouble with their blood flow due to blockages or narrowing, a condition known as coronary insufficiency. This issue can vary widely – it all depends on where the blockage is, how severe it is, and the specific arteries affected.
For some, this might just mean chest pain that pops up during a workout or some physical task, but goes away once they take a break. However, in more serious cases, if a coronary artery gets suddenly completely blocked, it can trigger a heart attack. This often starts with intense chest pain and, in the worst-case scenario, can lead to sudden death.
Variable definitions in the Dataset
Additional variable descriptions to help us
age - age in years
sex - sex (1 = male; 0 = female)
cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 0 = asymptomatic)
trestbps - resting blood pressure (in mm Hg on admission to the hospital)
chol - serum cholestoral in mg/dl
fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg - resting electrocardiographic results (1 = normal; 2 = having ST-T wave abnormality; 0 = hypertrophy)
thalach - maximum heart rate achieved
exang - exercise induced angina (1 = yes; 0 = no)
oldpeak - ST depression induced by exercise relative to rest
slope - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)
ca - number of major vessels (0-3) colored by flourosopy
thal - 2 = normal; 1 = fixed defect; 3 = reversable defect
num - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)
library(corrplot) # for the correlation plot
library(discrim) # for linear discriminant analysis
library(corrr) # for calculating correlation
library(knitr) # to help with the knitting process
library(MASS) # to assist with the markdown processes
library(tidyverse) # using tidyverse and tidymodels for this project mostly
library(tidymodels)
library(ggplot2) # for most of our visualizations
library(ggrepel)
library(rpart.plot) # for visualizing trees
library(vip) # for variable importance
library(janitor) # for cleaning out our data
library(ranger) # for building our randomForest
library(dplyr) # for basic r functions
library(yardstick) # for measuring certain metrics
library(naniar)
tidymodels_prefer()
Heart_data <- read_csv("data/heart.csv", show_col_types = FALSE)
Heart_data
## # A tibble: 303 × 14
## age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3 0
## 2 37 1 2 130 250 0 1 187 0 3.5 0
## 3 41 0 1 130 204 0 0 172 0 1.4 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1
## 7 56 0 1 140 294 0 0 153 0 1.3 1
## 8 44 1 1 120 263 0 1 173 0 0 2
## 9 52 1 2 172 199 1 1 162 0 0.5 2
## 10 57 1 2 150 168 0 1 174 0 1.6 2
## # ℹ 293 more rows
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>
vis_miss(Heart_data)
This plot shows us at a glance that there is no missing data in the whole dataset.
unique_number <- sapply(Heart_data, function(x) length(unique(x)))
unique_values_df <- data.frame("Total Unique Values" = unique_number)
rownames(unique_values_df) <- names(Heart_data)
print(unique_values_df)
## Total.Unique.Values
## age 41
## sex 2
## cp 4
## trtbps 49
## chol 152
## fbs 2
## restecg 3
## thalachh 91
## exng 2
## oldpeak 40
## slp 3
## caa 5
## thall 4
## output 2
numeric_var <- c("age", "trtbps", "chol", "thalachh", "oldpeak")
categoric_var <- c("sex", "cp", "fbs", "restecg", "exng", "slp", "caa", "thall", "output")
Heart_data %>%
select(numeric_var) %>%
summary()
## age trtbps chol thalachh oldpeak
## Min. :29.00 Min. : 94.0 Min. :126.0 Min. : 71.0 Min. :0.00
## 1st Qu.:47.50 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:133.5 1st Qu.:0.00
## Median :55.00 Median :130.0 Median :240.0 Median :153.0 Median :0.80
## Mean :54.37 Mean :131.6 Mean :246.3 Mean :149.6 Mean :1.04
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :77.00 Max. :200.0 Max. :564.0 Max. :202.0 Max. :6.20
ggplot(Heart_data, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
ggtitle("Age Distribution") +
xlab("Age") +
ylab("Frequency")
ggplot(Heart_data, aes(x = trtbps)) +
geom_histogram(binwidth = 5, fill = "red", color = "black") +
ggtitle(" resting blood pressure (in mm Hg) Distribution") +
xlab(" resting blood pressure (in mm Hg)") +
ylab("Frequency")
ggplot(Heart_data, aes(x = chol)) +
geom_histogram(binwidth = 5, fill = "purple", color = "black") +
ggtitle(" cholestoral Distribution") +
xlab(" cholestoral") +
ylab("Frequency")
ggplot(Heart_data, aes(x = thalachh)) +
geom_histogram(binwidth = 5, fill = "green", color = "black") +
ggtitle(" maximum heart rate achieved Distribution") +
xlab(" maximum heart rate achieved") +
ylab("Frequency")
ggplot(Heart_data, aes(x = oldpeak)) +
geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
ggtitle(" ST depression induced by exercise relative to rest Distribution") +
xlab(" ST depression induced by exercise relative to rest") +
ylab("Frequency")
categoric_var
## [1] "sex" "cp" "fbs" "restecg" "exng" "slp" "caa"
## [8] "thall" "output"
heart_data_summary_sex <- Heart_data %>%
count(sex) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_sex, aes(x = "", y = n, fill = factor(sex))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = c("#FF5733", "#33B5FF")) +
labs(title = "sex (Gender)", fill = "sex") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_cp <- Heart_data %>%
count(cp) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#33B5FF", "#CDDC39", "#9C27B0")
ggplot(heart_data_summary_cp, aes(x = "", y = n, fill = factor(cp))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "cp (Chest Pain type )", fill = "cp") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_fbs <- Heart_data %>%
count(fbs) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#33B5FF")
ggplot(heart_data_summary_fbs, aes(x = "", y = n, fill = factor(fbs))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "fbs (fasting blood sugar )", fill = "fbs") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_restecg <- Heart_data %>%
count(restecg) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#33B5FF", "#CDDC39")
ggplot(heart_data_summary_restecg, aes(x = "", y = n, fill = factor(restecg))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "restecg (resting electrocardiographic results)", fill = "restecg") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_exng <- Heart_data %>%
count(exng) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#CDDC39")
ggplot(heart_data_summary_exng, aes(x = "", y = n, fill = factor(exng))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "exng (exercise induced angina)", fill = "exng") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_slp <- Heart_data %>%
count(slp) %>%
mutate(perc = n / sum(n))
colors <- c("#FF5733", "#CDDC39", "#9C27B0")
ggplot(heart_data_summary_slp, aes(x = "", y = n, fill = factor(slp))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
scale_fill_manual(values = colors) +
labs(title = "slp (the slope of the peak exercise ST segment)", fill = "slp") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_caa <- Heart_data %>%
count(caa) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_caa, aes(x = "", y = n, fill = factor(caa))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
labs(title = "caa (number of major vessels)", fill = "caa") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_thall <- Heart_data %>%
count(thall) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_thall, aes(x = "", y = n, fill = factor(thall))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
labs(title = "thall (Thallium stress test)", fill = "thall") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
heart_data_summary_output <- Heart_data %>%
count(output) %>%
mutate(perc = n / sum(n))
ggplot(heart_data_summary_output, aes(x = "", y = n, fill = factor(output))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = scales::percent(perc)), position = position_stack(vjust = 0.5)) +
labs(title = "output (diagnosis of heart disease)", fill = "output") +
theme_void() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, color = "darkred", size = 15, face = "bold"),
legend.text = element_text(color = "darkblue", size = 13, face = "bold"))
thal_zero_rows <- Heart_data %>% filter(thall == 0)
thal_zero_rows
## # A tibble: 2 × 14
## age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 53 0 2 128 216 0 0 115 0 0 2
## 2 52 1 0 128 204 1 1 156 1 1 1
## # ℹ 3 more variables: caa <dbl>, thall <dbl>, output <dbl>
# 0 will be filled with 2 that is most common value in thall
Heart_data$thall[Heart_data$thall == 0] <- 2
unique_thal_categories <- unique(Heart_data$thall)
unique_thal_categories
## [1] 1 2 3
vis_miss(Heart_data)
Heart_data_long <- Heart_data %>%
gather(key = "variables", value = "value", trtbps, chol, thalachh)
ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(sex))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "Sex")
ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(cp))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "Cp")
ggplot(Heart_data_long, aes(x = variables, y = value, fill = factor(output))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "output")
Heart_data_long_2 <- Heart_data %>%
gather(key = "variables", value = "value",age)
ggplot(Heart_data_long_2, aes(x = variables, y = value, fill = factor(output))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "output")
Heart_data_long_3 <- Heart_data %>%
gather(key = "variables", value = "value",oldpeak)
ggplot(Heart_data_long_3, aes(x = variables, y = value, fill = factor(output))) +
geom_boxplot() +
labs(title = "Numerical Variables - Categorical Variables (Box Plot)",
x = "Variables",
y = "Value",
fill = "output")
cor_mat <- cor(Heart_data)
corrplot(cor_mat, method = "color",
addCoef.col = "black",
tl.cex = 0.8,
number.cex = 0.6,
cl.cex = 0.8,
tl.col = "black",
tl.srt = 45,
order = "hclust" )
#### 4.2.3.1 Analysis Outputs(4)
Heart_data <- Heart_data %>% select(-c(chol, fbs, restecg))
Heart_data
## # A tibble: 303 × 11
## age sex cp trtbps thalachh exng oldpeak slp caa thall output
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 150 0 2.3 0 0 1 1
## 2 37 1 2 130 187 0 3.5 0 0 2 1
## 3 41 0 1 130 172 0 1.4 2 0 2 1
## 4 56 1 1 120 178 0 0.8 2 0 2 1
## 5 57 0 0 120 163 1 0.6 2 0 2 1
## 6 57 1 0 140 148 0 0.4 1 0 1 1
## 7 56 0 1 140 153 0 1.3 1 0 2 1
## 8 44 1 1 120 173 0 0 2 0 3 1
## 9 52 1 2 172 162 0 0.5 2 0 3 1
## 10 57 1 2 150 174 0 1.6 2 0 2 1
## # ℹ 293 more rows